In this practical, a number of R packages are used. If any of them are not installed you can still follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.3.0)visdat (version: 0.5.2)JointAI (version: 0.4.0)VIM (version: 4.7.0)plyr (version: 1.8.4)corrplot (version: 0.84)You can find help files for any function by adding a ? before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this example, we will use the NHANES dataset. To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").
Let’s take a first look at the data. Useful functions are dim(), head(), str() and summary().
## [1] 1000 24
## 'data.frame': 1000 obs. of 24 variables:
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ BMI : num 27 29.5 29.8 17.5 31.5 ...
## $ HDL : num 1.06 1.29 0.85 1.99 NA 1.71 1.19 NA 1.71 1.32 ...
## $ hypchol : Factor w/ 2 levels "no","yes": 1 2 1 1 NA 1 1 NA 1 1 ...
## $ hypten : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 2 1 1 ...
## $ SBP : num NaN 128.7 114.7 96.7 132 ...
## $ bili : num 0.9 0.9 0.7 1 NA 0.7 0.5 NA 0.7 0.7 ...
## $ educ : Factor w/ 5 levels "Less than 9th grade",..: 3 5 5 5 2 1 2 4 4 3 ...
## $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: 3 NA NA NA 3 NA NA 3 NA NA ...
## $ creat : num 1.05 0.98 0.83 0.72 NA 0.75 0.73 NA 0.89 0.95 ...
## $ cohort : chr "2011" "2011" "2011" "2011" ...
## $ age : num 58 57 56 31 79 60 37 68 23 20 ...
## $ DBP : num NaN 94.7 72 65.3 52 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 1 1 1 1 1 1 ...
## $ wgt : num 93 90.7 76.2 52.2 70.8 ...
## $ alc : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 1 5 1 NA 2 NA NA NA 3 2 ...
## $ WC : num 99.9 111.6 90.1 76.9 100.2 ...
## $ hgt : num 1.85 1.75 1.6 1.73 1.5 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 3 2 5 3 4 5 5 4 1 2 ...
## $ chol : num 4.76 6.83 2.79 5.09 NA 5.59 6.08 NA 4.03 3.31 ...
## $ uricacid: num 6.8 5.9 6.1 3.4 NA 5.1 6.1 NA 7.3 3.5 ...
## $ DM : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ gender : Factor w/ 2 levels "male","female": 1 1 1 2 2 2 2 2 1 1 ...
## $ albu : num 4.7 4 4.2 4.3 NA 4.1 3.7 NA 4.4 4.6 ...
## occup BMI HDL hypchol hypten SBP
## working :564 Min. :15.50 Min. :0.360 no :799 no :717 Min. : 81.33
## looking for work: 44 1st Qu.:22.86 1st Qu.:1.110 yes :113 yes :241 1st Qu.:107.67
## not working :378 Median :26.32 Median :1.340 NA's: 88 NA's: 42 Median :117.33
## NA's : 14 Mean :27.34 Mean :1.385 Mean :119.06
## 3rd Qu.:30.62 3rd Qu.:1.600 3rd Qu.:127.33
## Max. :62.35 Max. :4.030 Max. :202.00
## NA's :35 NA's :88 NA's :61
## bili educ HyperMed creat cohort
## Min. :0.2000 Less than 9th grade : 90 no : 25 Min. :0.4300 Length:1000
## 1st Qu.:0.6000 9-11th grade :133 previous: 14 1st Qu.:0.6800 Class :character
## Median :0.7000 High school graduate:214 yes :144 Median :0.8000 Mode :character
## Mean :0.7479 some college :288 NA's :817 Mean :0.8387
## 3rd Qu.:0.9000 College or above :275 3rd Qu.:0.9500
## Max. :4.1000 Max. :6.1800
## NA's :92 NA's :91
## age DBP smoke wgt alc WC
## Min. :20.0 Min. : 0.00 never :624 Min. : 38.56 0 :117 Min. : 61.90
## 1st Qu.:30.0 1st Qu.: 63.33 former :176 1st Qu.: 63.39 <=1 :293 1st Qu.: 83.10
## Median :42.0 Median : 70.00 current:199 Median : 75.30 1-3 :100 Median : 94.15
## Mean :44.3 Mean : 70.12 NA's : 1 Mean : 77.45 3-7 : 70 Mean : 95.33
## 3rd Qu.:58.0 3rd Qu.: 76.67 3rd Qu.: 88.45 >7 :105 3rd Qu.:104.70
## Max. :79.0 Max. :110.00 Max. :163.29 NA's:315 Max. :170.50
## NA's :61 NA's :16 NA's :60
## hgt race chol uricacid DM gender
## Min. :1.397 Mexican American :123 Min. :2.070 Min. : 1.800 no :861 male :464
## 1st Qu.:1.600 Other Hispanic :119 1st Qu.:4.240 1st Qu.: 4.200 yes:139 female:536
## Median :1.676 Non-Hispanic White:346 Median :4.890 Median : 5.200
## Mean :1.681 Non-Hispanic Black:212 Mean :4.985 Mean : 5.303
## 3rd Qu.:1.753 other :200 3rd Qu.:5.590 3rd Qu.: 6.300
## Max. :1.981 Max. :9.930 Max. :11.200
## NA's :24 NA's :88 NA's :91
## albu
## Min. :2.60
## 1st Qu.:4.10
## Median :4.30
## Mean :4.27
## 3rd Qu.:4.50
## Max. :5.20
## NA's :91
It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.
str() showed that in the NHANES data smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.
Using levels(NHANES$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.
## [1] "Less than 9th grade" "9-11th grade" "High school graduate" "some college"
## [5] "College or above"
## Ord.factor w/ 5 levels "Less than 9th grade"<..: 3 5 5 5 2 1 2 4 4 3 ...
In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are
md.pattern() from package micemd_pattern() from package JointAI (with argument patter = TRUE)aggr() from package VIMvis_dat() and vis_miss() from package visdatExplore the missing data pattern of the NHANES data.
## educ cohort age race DM gender smoke occup wgt hgt BMI hypten WC SBP DBP HDL hypchol chol creat
## 33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 534 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 69 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 149 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 11 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## 9 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 5 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 10 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## 0 0 0 0 0 0 1 14 16 24 35 42 60 61 61 88 88 88 91
## uricacid albu bili alc HyperMed
## 33 1 1 1 1 1 0
## 534 1 1 1 1 0 1
## 69 1 1 1 0 1 1
## 149 1 1 1 0 0 2
## 1 1 1 0 1 0 2
## 2 0 0 0 1 0 5
## 1 0 0 0 0 1 5
## 20 0 0 0 1 1 7
## 26 0 0 0 1 0 8
## 9 0 0 0 0 1 8
## 10 0 0 0 0 0 9
## 11 1 1 1 1 1 2
## 2 1 1 1 0 1 3
## 1 0 0 0 0 1 10
## 9 1 1 1 1 1 1
## 8 1 1 1 1 0 2
## 10 1 1 1 0 1 2
## 7 1 1 1 0 0 3
## 3 0 0 0 1 0 9
## 1 0 0 0 0 1 9
## 5 0 0 0 0 0 10
## 2 1 1 1 0 1 4
## 3 0 0 0 0 1 11
## 1 1 1 1 1 0 2
## 12 1 1 1 1 0 4
## 13 1 1 1 0 0 5
## 3 0 0 0 0 0 12
## 1 1 1 1 1 0 5
## 2 1 1 1 0 0 6
## 1 0 0 0 1 0 12
## 1 0 0 0 0 0 13
## 1 1 1 1 1 1 2
## 2 1 1 1 1 0 3
## 1 1 1 1 0 1 3
## 7 1 1 1 0 0 4
## 1 0 0 0 1 0 10
## 1 1 1 1 0 1 4
## 1 1 1 1 0 1 6
## 1 1 1 1 1 0 6
## 2 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 1 0 0 0 0 0 15
## 2 1 1 1 1 1 2
## 1 1 1 1 1 0 3
## 2 1 1 1 0 1 3
## 3 1 1 1 0 0 4
## 1 0 0 0 0 1 11
## 1 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 3 1 1 1 1 0 4
## 2 1 1 1 0 0 5
## 2 1 1 1 1 1 1
## 10 1 1 1 1 0 2
## 1 0 0 0 0 1 10
## 1 1 1 1 0 0 6
## 1 0 0 0 0 0 10
## 91 91 92 315 817 2075
md.pattern() from package mice gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The rownames show how many rows of the dataset have the given pattern. The last column shows the number of missing values in each pattern, the last row the number of missing values per variable.
md.pattern() also plots the missing data pattern automatically.
## educ cohort age race DM gender smoke occup wgt hgt BMI hypten WC SBP DBP HDL hypchol chol
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
## 7 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
## 8 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1
## 10 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## 11 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
## 14 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 15 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 16 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0
## 19 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## 20 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0
## 22 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0
## 23 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1
## 24 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1
## 25 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1
## 26 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1
## 27 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1
## 28 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1
## 29 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
## 30 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 31 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1
## 32 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## 33 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1
## 34 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1
## 35 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1
## 36 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 37 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
## 38 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1
## 39 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
## 40 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1
## 41 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 42 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 43 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0
## 44 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0
## 45 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
## 46 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1
## 47 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0
## 48 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0
## 49 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 50 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1
## 51 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0
## 52 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1
## 53 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 54 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1
## 56 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1
## Nmis 0 0 0 0 0 0 1 14 16 24 35 42 60 61 61 88 88 88
## creat uricacid albu bili alc HyperMed Npat
## 1 1 1 1 1 1 0 534
## 2 1 1 1 1 0 0 149
## 3 1 1 1 1 0 1 69
## 4 1 1 1 1 1 1 33
## 5 0 0 0 0 1 0 26
## 6 0 0 0 0 1 1 20
## 7 1 1 1 1 0 0 13
## 8 1 1 1 1 1 0 12
## 9 1 1 1 1 1 1 11
## 10 1 1 1 1 1 0 10
## 11 1 1 1 1 0 1 10
## 12 0 0 0 0 0 0 10
## 13 0 0 0 0 0 1 9
## 14 1 1 1 1 1 1 9
## 15 1 1 1 1 1 0 8
## 16 1 1 1 1 0 0 7
## 17 1 1 1 1 0 0 7
## 18 0 0 0 0 0 0 5
## 19 0 0 0 0 0 1 3
## 20 1 1 1 1 1 0 3
## 21 0 0 0 0 1 0 3
## 22 0 0 0 0 0 0 3
## 23 1 1 1 1 0 0 3
## 24 1 1 1 1 0 1 2
## 25 1 1 1 1 0 0 2
## 26 1 1 1 1 0 0 2
## 27 1 1 1 1 0 0 2
## 28 1 1 1 1 0 1 2
## 29 1 1 1 1 1 0 2
## 30 0 0 0 0 1 0 2
## 31 1 1 1 1 0 1 2
## 32 1 1 1 1 1 1 2
## 33 1 1 1 1 1 1 2
## 34 1 1 1 1 0 1 1
## 35 1 1 1 1 1 0 1
## 36 1 1 1 1 1 0 1
## 37 0 0 0 0 0 0 1
## 38 1 1 1 1 1 0 1
## 39 0 0 0 0 0 1 1
## 40 1 1 1 1 1 0 1
## 41 1 1 1 0 1 0 1
## 42 0 0 0 0 0 0 1
## 43 0 0 0 0 0 1 1
## 44 0 0 0 0 0 1 1
## 45 1 1 1 1 0 1 1
## 46 1 1 1 1 0 0 1
## 47 0 0 0 0 0 0 1
## 48 0 0 0 0 0 1 1
## 49 0 0 0 0 0 1 1
## 50 1 1 1 1 0 0 1
## 51 0 0 0 0 1 0 1
## 52 1 1 1 1 0 1 1
## 53 0 0 0 0 1 0 1
## 54 1 1 1 1 1 1 1
## 55 1 1 1 1 0 0 1
## 56 1 1 1 1 0 0 1
## Nmis 91 91 91 92 315 817 2075
The function md_pattern() from package JointAI gives a matrix very similar to the one obtained from mice::md.pattern(). Hovever, here, the number of rows in the data that have a particular missing data pattern are given in the last column.
For more information on how to customize the visualization by md_pattern() see the vignette Visualizing Incomplete Data on the webpage of JointAI.
aggr() from VIM plots a histogram of the proportion of missing values per column as well as a visualization of the missing data pattern. Here, a small histogram on the right edge of the plot shows the proportion of cases in each pattern.
For more options how to customize the plot, see the VIM documentation.
Now get an overview of
The function complete.cases() will give you a vector that is TRUE if the the case is complete and FALSE if there are missing values.
is.na() returns TRUE if the value is missing, FALSE if the value is observed
colSums() calculates the sum of values in each column of a data.frame or matrix
colMeans() calculates the mean of values in each column of a data.frame or matrix
# number and proportion of complete cases
cbind(
"#" = table(ifelse(complete.cases(NHANES), 'complete', 'incomplete')),
"%" = round(100 * table(complete.cases(NHANES))/nrow(NHANES), 2)
)## # %
## complete 33 96.7
## incomplete 967 3.3
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(NHANES))),
"% NA" = round(sort(colMeans(is.na(NHANES))) * 100, 2))## # NA % NA
## educ 0 0.0
## cohort 0 0.0
## age 0 0.0
## race 0 0.0
## DM 0 0.0
## gender 0 0.0
## smoke 1 0.1
## occup 14 1.4
## wgt 16 1.6
## hgt 24 2.4
## BMI 35 3.5
## hypten 42 4.2
## WC 60 6.0
## SBP 61 6.1
## DBP 61 6.1
## HDL 88 8.8
## hypchol 88 8.8
## chol 88 8.8
## creat 91 9.1
## uricacid 91 9.1
## albu 91 9.1
## bili 92 9.2
## alc 315 31.5
## HyperMed 817 81.7
In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.
Our data contains hgt (height), wgt (weight) and BMI. Check the missing data pattern specifically for those three variables.
# Three solutions to choose from:
JointAI::md_pattern(NHANES[, c("hgt", "wgt", "BMI")], pattern = TRUE, plot = FALSE)## wgt hgt BMI Npat
## 1 1 1 1 965
## 2 1 0 0 19
## 3 0 1 0 11
## 4 0 0 0 5
## Nmis 16 24 35 75
with(NHANES,
table(ifelse(is.na(hgt), 'height mis.', 'height obs.'),
ifelse(is.na(wgt), 'weight mis.', 'weight obs.'),
ifelse(is.na(BMI), 'BMI mis.', 'BMI obs.'))
)## , , = BMI mis.
##
##
## weight mis. weight obs.
## height mis. 5 19
## height obs. 11 0
##
## , , = BMI obs.
##
##
## weight mis. weight obs.
## height mis. 0 0
## height obs. 0 965
plyr::ddply(NHANES, c(height = "ifelse(is.na(hgt), 'missing', 'observed')",
weight = "ifelse(is.na(wgt), 'missing', 'observed')",
BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
plyr::summarize,
N = length(hgt)
)As we have already seen in the lecture, there are some cases where only either hgt or wgt is missing. BMI is only observed when both components are observed. To use all available information, we want to impute hgt and wgt separately and calculate BMI from the imputed values.
The data contains variables on hypertension (hypten) and the use of medication to treat hypertension (HyperMed). We might expect that medication is only prescribed for patients with hypertension, hence, we need to investigate the relationship between HyperMed and hypten.
## hypten
## HyperMed no yes <NA>
## no 0 25 0
## previous 0 14 0
## yes 0 144 0
## <NA> 717 58 42
Furthermore, we can expect a systematic relationship between the variables hypchol (Hypercholesterolemia) and chol (serum cholesterol). Find out how these two variables are related.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.070 4.140 4.760 4.715 5.350 6.180 88
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.210 6.410 6.650 6.896 7.140 9.930 88
It seems that hypchol is defined as chol > 6.2, which makes hypchol dependent on chol.
Visualize the correlations between variables to detect associations that you may have missed.
# convert all variables to numeric
dat_num <- sapply(NHANES, as.numeric)
cormat <- cor(dat_num, use = 'pair', method = 'spearman')
corrplot::corrplot(cormat, method = 'square', type = 'lower')Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.
The questionmaks in the plot indicate that no correlation could be calculated for cohort (because cohort is the same for all subjects) and neither between HyperMed and hypten (but we already knew that).
The plot shows that besides the relations we were already aware of, wgt is strongly correlated with WC. Comparing the number of cases where only either wgt or WC is missing shows that there are 14 cases where wgt is missing but WC is observed and 58 cases where WC is missing and wgt is observed.
with(NHANES, table(ifelse(is.na(WC), 'WC mis.', 'WC obs.'),
ifelse(is.na(wgt), 'wgt mis.', 'wgt obs.')
))##
## wgt mis. wgt obs.
## WC mis. 2 58
## WC obs. 14 926
Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.
Visualize the distributions of the incomplete continuous and categorical variables. The package JointAI provides the convenient function plot_all(),
Pay attention to
To learn more about additional options of plot_all() check out the vignette Visualizing Incomplete Data on the webpage of JointAI.
© Nicole Erler